Data Quality Constraints

In data quality you can select data quality constraints according to your usecase. The Lazsa Platform supports the following constraints:

Constraint Description Data Type to which it is applicable
Profiler Constraints  
ApproxCountDistinct Returns the approximate number of distinct values in a column. Numeric, string
Completeness Returns the number of non-null values in a column. For example if customer name is required, then whether first name and last name are present for all records. If either of the two is missing the record is incomplete. Numeric, string
DataType Returns the data type of the column. For example - boolean, fraction, integer and so on. Numeric, string
Validity Assesses whether the data in a column adheres to the specified format or constraints, based on the provided regex. String
Count Checks distinct, filled and null counts. Numeric, string
Character Count Calculates numbers, numbers only, letters only, numbers and letters, and special characters. Numeric, string
Statistical Value Involves various statistical measures (e.g. minimum, maximum, mean, standard deviation) to describe the distribution of numeric data. Numeric
Recommendation Provides suggestions or recommendations based on the profiling results, indicating potential improvements or actions. Numeric, string
Analyzer Constraints  
ApproxCountDistinct Returns the approximate number of distinct values in a column. Numeric, string
Completeness Returns the number of non-null values in a column. For example if customer name is required, then whether first name and last name are present for all records. If either of the two is missing the record is incomplete. Numeric, string
Compliance Calculates the fraction of rows that match the given column constraint. Numeric, string
Correlation Calculates the pearson correlation coefficient between the selected columns. Numeric

CountDistinct

Returns the count of distinct elements in a column. Numeric, string
DataType Returns the data type of the column. For example - boolean, fraction, integer and so on. Numeric, string
Distinctness Returns the count of distinct values in a column. Numeric, string
Entropy

Returns the measure of disorder contained in a message.

Numeric, string
Maximum Returns the maximum value of a numeric column. Numeric
MaxLength Returns the maximum length of a column with data type -string. String
Mean Returns the average value of a numeric column. Numeric
Minimum Returns the minimum value of a numeric column. Numeric
MinLength Returns the minimum length of a column with data type - string. String
MutualInformation Information about one column that can be inferred from another column. Numeric, string
PatternMatch Returns the regex pattern. String
Size Returns the size of data. N/A
StandardDeviation Shows the variation from the mean value of a column. Numeric
Sum Provides the sum of the column values. Numeric
UniqueValueRatio Returns the ratio of uniqueness of a column. Numeric, string
Uniqueness Returns the ratio of unique values against all values of a column. Numeric, string
Validator Constraints  
hasSize Calculates the data frame size. N/A
isComplete Confirms whether a column is complete. Numeric, string
hasCompleteness Confirms whether a column is complete based on the historical completeness of the column. Numeric, string
isUnique Confirms whether a column is unique. Numeric, string
hasUniqueness Confirms whether a column or set of columns have uniqueness. Uniqueness is a fraction of unique values of a column. Numeric, string
hasDistinctness Confirms whether a column or set of columns have distinctness. Distinctness is a fraction of distinct values of a column. Numeric, string
hasUniqueValueRatio Confirms whether there is a unique value ratio in a column or set of columns. Numeric, string
hasEntropy Confirms whether a column has entropy. Entropy is a measure of disorder contained in a message. Numeric, string
hasMutualInformation Confirms whether two columns have mutual information. Mutual information means how much information about one column can be inferred from another column. Numeric, string

hasMinLength

Confirms the minimum length of a column with string data type. String
hasMaxLength Confirms the maximum length of a column with string data type. String
hasMin Confirms the minimum of a column, that contains a long, integer, or float data type. Numeric
hasMax Confirms the maximum of a column, that contains a long, integer, or float data type. Numeric
hasSum Confirms the sum the column. Numeric
hasMean Confirms the mean of the column. Numeric
hasStandardDeviation Confirms that the column has variation from the mean value. Numeric
hasApproxCountDistinct Confirms that the column has approximate distinct count. Numeric, string
hasCorrelation Confirms that there exists a pearson correlation between two columns. Numeric
hasPattern Confirms whether the pattern of values of a column match that of the regular expression. String
containsCreditCardNumber Checks and confirms whether a column has credit card number pattern. String
containsEmail Checks and confirms whether a column has email pattern. String
containsURL Checks and confirms whether a column has URL pattern. String
containsSocialSecurityNumber Checks and confirms whether a column has pattern for Social Security Number for the USA. String
isNonNegative Checks and confirms that a column does not contain any negative values. Numeric
isPositive Checks and confirms that a column does not contain any negative value and is greater than 0. Numeric
isLessThan Checks and confirms that in each row, the value of column A is greater than the value of column B. Numeric
isLessThanOrEqualTo Checks and confirms that in each row, the value of column A is less than or equal to the value of column B. Numeric
isGreaterThan Checks and confirms that in each row, the value of column A is greater than the value of column B. Numeric
isGreaterThanOrEqualTo Checks and confirms that in each row, the value of column A is greater than or equal to the value of column B. Numeric
isContainedIn Checks and confirms that the value in a column is contained in a set of predefined values. Numeric
Issue Resolver Constraints  
Handle Duplicate Data Choose the column with a unique key to address duplicate entries. This operation can be applied to one or more columns. If duplicates exist, those records will be filtered out, and the filtered records can be stored in the rejected records path if the user opts for it. Numeric, string
Replace Selective Data Specify multiple values separated by commas to replace targeted values in the dataset. Numeric, string
Handle Missing Data Manage null or empty values by choosing to either fill or remove records. If the user opts to remove records, the discarded entries will be stored in the rejected records path. Numeric, string
Handle Outliers Input an integer value to address outliers. Based on specified conditions, records falling within these criteria will have their column values replaced with mean, max, minimum, or dropped, depending on user preferences. Rejected records will be stored in the rejected path. Numeric
Handle String Operations Execute selected string operations on specific columns, including trim, rtrim, lpad, rpad, substring, and regexp_replace. String
Handle Case Sensitivity Define whether to account for case sensitivity, including options for upper case, lower case, and proper case. String
Handle Data Against Master Table Conduct a lookup against the master table, allowing users to provide either static values or master table column names. Entries that do not match the master table or static values will be filtered out and saved in the rejected records path. Numeric, string

 

Related Topics Link IconRecommended Topics What's next? Databricks Data Analyzer